Back

Journal of Bioinformatics and Systems Biology

Fortune Journals

Preprints posted in the last 30 days, ranked by how well they match Journal of Bioinformatics and Systems Biology's content profile, based on 14 papers previously published here. The average preprint has a 0.03% match score for this journal, so anything above that is already an above-average fit.

1
Evaluating the reliability of tools for mRNA annotation and IRES studies

May, G. E.; Akirtava, C.; McManus, J.

2026-03-31 genomics 10.64898/2026.03.29.707813 medRxiv
Top 0.1%
2.8%
Show abstract

Since the discovery of viral Internal Ribosome Entry Sites (IRESes), researchers have sought to find similar elements in mammalian host genes, termed "cellular IRESes". However, the plasmid systems used to measure cellular IRES activity are vulnerable to false positives due to promoter activity in candidate IRESes. Orthogonal methods are needed to validate putative IRESes while carefully avoiding artifacts known to cause false positives. Recently, Koch et al. proposed approaches for studying IRESes, primarily circular RNA-generating plasmids, and for validating mRNA transcripts using smFISH and qRT-PCR. Here, we demonstrate confounding variables and artifacts in each of these approaches that can lead to inappropriate conclusions about potential cellular IRES activity. We show the back-splicing circRNA plasmid creates linear mRNA artifacts associated with false-positive IRES signals. Using orthogonal, gold-standard assays validated with viral IRESes, we find putative cellular IRESes reported using the back-splicing plasmid have no IRES activity. Furthermore, we demonstrate that smFISH and qRT-PCR can misidentify nuclear non-coding RNAs as mRNAs and we validate a single molecule sequencing assay for identifying genuine mRNA 5 ends. Our work establishes reliable methods for robust transcript annotation and IRES studies that avoid documented artifacts arising from bicistronic and back-splicing circRNA plasmid reporters.

2
Multistage Machine Learning Reveals Circadian Gene Programs and Supports a Retina-Choroid Axis in Myopia Development

Watcharapalakorn, A.; Poyomtip, T.; Tawonkasiwattanakun, P.; Dewi, P. K. K.; Thomrongsuwannakij, T.; Mahawan, T.

2026-04-06 bioinformatics 10.64898/2026.04.02.716020 medRxiv
Top 0.4%
0.9%
Show abstract

PurposeTo determine whether circadian timing defines critical molecular windows in myopia development and to assess the transferability of circadian gene programs across ocular tissues, disease stages, and species. MethodsPublicly available retinal and choroidal RNA-seq datasets from chick models of form-deprivation myopia were analyzed using unsupervised transcriptomic profiling and multistage machine-learning classification. Circadian windows were defined based on Zeitgeber time, and samples were grouped accordingly for downstream analyses. Classification model robustness was evaluated through cross-tissue and cross-stage validation and further assessed using external validation in an independent dataset. Functional translation to humans was examined using ortholog-based Gene Ontology enrichment analysis to identify conserved biological processes and higher-order regulatory pathways. ResultsA circadian critical window at ZT8-ZT12 exhibited the strongest transcriptional divergence during both myopia onset and progression. Gene signatures derived from this window generalized across retina and choroid and remained predictive across disease stages, supporting coordinated molecular regulation between ocular tissues. External validation confirmed the reproducibility of these signatures despite differences in experimental design and gene coverage. Functional mapping revealed that conserved molecular components in chicks are reorganized into more complex neuroendocrine and regulatory networks in humans, indicating cross-species conservation with increased functional complexity. ConclusionsCircadian timing strongly shapes myopia-related gene expression and underlies coordinated retina-choroid signaling. These findings highlight circadian biology as a key factor of refractive development and suggest that time-dependent mechanisms may influence myopia susceptibility, progression, and response to treatment.

3
Clarified an rDNA Gene Unit Pattern with (CTTT)n and (CT)n Microsatellites Aggregation Ahead of and Behind the Gene in Human Genome

Shen, J.; Tang, S.; Xia, Y.; Qin, J.; Xu, H.; Tan, Z.

2026-03-24 genetics 10.64898/2026.03.22.713381 medRxiv
Top 0.6%
0.8%
Show abstract

BackgroundConventional models of human ribosomal DNA (rDNA) array organization have historically depended on transcription-centric boundaries, partitioning the unit into a [~]13 kb rDNA transcription region and a monolithic [~]31 kb intergenic spacer (IGS). While our previous identification of Duplication Segment Units (DSUs) mapped these arrays based on an intuitive analysis of the microsatellite density landscape of the complete reference human genome, our present deep mining of this landscape has revealed a more accurate rDNA Gene Unit Pattern. Methods & ResultsIn this study, we conducted a deep mining analysis of our previously established microsatellite density landscape of the T2T-CHM13 assembly, focusing specifically on nucleolar organizing regions (NORs). We suggest a more accurate rDNA Gene Unit Pattern containing a (CTTT)n microsatellite aggregation ahead of the rDNA gene and a (CT)n microsatellite aggregation behind the gene, rather than a pattern featuring an IGS region inserted between two rDNA genes. ConclusionsA correct rDNA gene pattern of the human genome probably includes a (CTTT)n microsatellite aggregation ahead of the gene and a (CT)n microsatellite aggregation behind it, which possibly constitute cis- and trans-regulating regions; the (CTTT)n and (CT)n microsatellite aggregations may provide two different local stable DNA structures for regulatory protein binding.

4
High prevalence of loss of Y chromosome in the spermatozoa of young cancer survivors

Axelsson, J.; Bruhn-Olszewska, B.; Sarkysian, D.; Markljung, E.; Horbacz, M.; Pla, I.; Sanchez, A.; Nenonen, H.; Elenkov, A.; Dumanski, J. P.; Giwercman, A.

2026-03-23 genetic and genomic medicine 10.64898/2026.03.20.26348822 medRxiv
Top 0.7%
0.7%
Show abstract

Cancer-related genomic instability (GI) may cause genetic alterations in spermatozoa, implying health issues not only in cancer survivors, but also in their children [1, 2]. We therefore studied Loss of Y chromosome (LOY), considered as hallmark of GI [3-15], in spermatozoa and blood from survivors of childhood and testicular cancer (CC, TC), and controls (CTRL). We found that LOY was statistically significantly more frequent in spermatozoa from cancer survivors than in controls (Odds Ratio [OR]=2.2 for CC vs. CTRL and OR=2.4 for TC vs. CTRL). Furthermore, LOY was about an order of magnitude more prevalent in spermatozoa than in blood among 18-53-year-old males within all cohorts. Our findings suggest that LOY in spermatozoa might be a clinically useful marker of GI, reduced fertility and disease predisposition in males. Introducing LOY in spermatozoa as a biomarker opens a new research avenue into disease prevention and the causes and consequences of LOY.

5
BrightEyes-FFS: an open-source platform for comprehensive analysis of fluorescence fluctuation spectroscopy experiments with small detector arrays

Slenders, E.; Perego, E.; Zappone, S.; Vicidomini, G.

2026-04-10 bioinformatics 10.64898/2026.04.08.717207 medRxiv
Top 0.7%
0.7%
Show abstract

Fluorescence fluctuation spectroscopy (FFS) is an ensemble of techniques for quantitative measurement of molecular dynamics and interactions. Recently, the introduction of small-format array detectors has opened up a new range of spatiotemporal information, allowing for more detailed analysis of system kinetics. However, there is currently no open-source software available for analyzing the high-dimensional FFS data sets. We present BrightEyes-FFS, an open-source Python-based environment for FFS analysis with array detectors. The environment includes a Python package for reading raw FFS data, computing auto- and cross-correlations using various algorithms, and fitting the correlations to several models. A graphical user interface (GUI), available as a standalone executable, makes the analysis fast and user-friendly. An automated Jupyter Notebook writing tool enables transition from the GUI to Jupyter Notebook for custom analysis. We believe that BrightEyes-FFS will enable a wider community to study diffusion, flow, and interaction dynamics.

6
Cleavage specificity of E. coli YicC endoribonuclease

Barnes, S. A.; Lazarus, M. B.; Bechhofer, D. H.

2026-03-26 molecular biology 10.64898/2026.03.25.714237 medRxiv
Top 0.8%
0.7%
Show abstract

Escherichia coli YicC enzyme is the founding member of a family of endoribonucleases that is encoded in virtually all bacterial species. Previous structural studies revealed that this ribonuclease binds RNA by a novel mechanism in which the hexameric apoprotein presents an open channel that undergoes a large rotation upon RNA binding and clamps down on the RNA. The current study follows up on these findings by examining the cleavage of various oligonucleotide substrates designed to probe recognition elements required for YicC binding and cleavage. A 26-nucleotide RNA oligomer (oligo), with a KD in the low micromolar range, was the standard to which numerous oligos with altered sequence were compared. In vitro RNase assays and fluorescence anisotropy binding measurements indicated that the preferred substrates for YicC were relatively small RNAs that contain some secondary structure. Larger RNAs or highly structured RNAs were less-than-optimal substrates. Similarly, RyhB RNA, a [~]90-nucleotide, iron-responsive RNA of E. coli, which has been described as a target of YicC binding and/or cleavage, was a poor YicC substrate in our assays. These results suggest that the native substrates for YicC-family members are very small RNAs or RNA fragments derived from larger RNAs.

7
Correlate: A Web Application for Analyzing Gene Sets and Exploring Gene Dependencies Using CRISPR Screen Data

Deolankar, S.; Wermeling, F.

2026-04-04 bioinformatics 10.64898/2026.04.02.716070 medRxiv
Top 0.9%
0.7%
Show abstract

CRISPR screen data provides a valuable resource for understanding gene function and identifying potential drug targets. Here, we present Correlate, a freely accessible web application (https://correlate.cmm.se) that enables exploration of the Cancer Dependency Map (DepMap) CRISPR screen gene effects, hotspot mutations, and translocation/fusion data across more than 1,000 human cancer cell lines. The application supports two main use cases: (i) analysis of user-defined gene sets (e.g. CRISPR screen hits) to identify functionally linked genes based on correlations while providing an overview based on essentiality or user-provided screen statistics; and (ii) exploration of genes of interest in defined biological contexts, such as specific cancer types or mutational backgrounds, to generate hypotheses about gene function and dependencies. Additionally, Correlate supports experimental design by providing rapid overviews of gene essentiality and enabling the identification of cell lines with relevant mutational profiles. In contrast to knowledge-based approaches such as STRING and GSEA, which rely on prior biological annotations and curated interaction networks, Correlate identifies gene connections directly from functional CRISPR screen readouts, offering a complementary and data-driven perspective on gene network analysis. The application runs entirely in the browser, requires no installation or login, and integrates with the Green Listed v2.0 tool family for custom CRISPR screen design. HIGHLIGHTS{blacksquare} Interactive web-based platform for bulk correlation analysis of user-defined gene sets using DepMap CRISPR screen data, requiring no installation or programming expertise. {blacksquare}Identifies functional gene relationships from CRISPR screen readouts rather than curated annotations, offering a data-driven complement to tools such as GSEA and STRING. {blacksquare}Enables contextual exploration of gene dependencies across cancer types and mutational backgrounds, supporting hypothesis generation about gene function and therapeutic targets. {blacksquare}Supports experimental design through gene essentiality overviews, mutation and fusion analysis, and cell line identification, with optional integration of user-provided statistics from CRISPR screens, proteomics, or transcriptomics analyses.

8
Identification of a microRNA with a mutation in the loop structure in the silkworm Bombyx mori

Harada, M.; Tabara, M.; Kuriyama, K.; Ito, K.; Bono, H.; Sakamoto, T.; Nakano, M.; Fukuhara, T.; Toyoda, A.; Fujiyama, A.; Tabunoki, H.

2026-03-27 molecular biology 10.64898/2026.03.24.714027 medRxiv
Top 0.9%
0.7%
Show abstract

MicroRNAs (miRNAs) play essential roles in the posttranscriptional regulation of gene expression in organisms. In the process of synthesizing mature miRNAs from miRNA precursors, the miRNA precursors are cleaved via Dicer at their loop structure, after which the miRNA precursors become mature and regulate transcription. However, the consequences of altering the loop sequence are not fully understood. The silkworm Bombyx mori is a lepidopteran insect with many genetic strains. We identified a mutant of the miRNA miR-3260 whose the part of the loop structure was lacking in a silkworm strain with translucent larval skin. Here, we aimed to analyze the role of wild-type miR-3260 and the influence of the mutation of the loop structure in B. mori. First, we identified the genomic region responsible for the translucent larval skin phenotype and determined that the mutated miR-3260 nucleotide sequences. Then, we predicted the binding partners of wild-type miR-3260 using the RNA hybrid tool and found two juvenile hormone (JH)-related genes as targets of wild-type miR-3260. Next, we assessed the relationships between miR-3260 and JH and found that miR-3260 was highly expressed in the Corpora allata and its expression responded to JH treatment. Meanwhile, miR-3260 mimic and inhibitor did not induce the typical phenotypes associated with JH in B. mori. Then, we compared the dicing products from wild-type and mutant miR-3260 precursors and observed that neither form underwent Dicer-mediated cleavage when the loop structure was altered. These results suggest that loop mutations in the miR-3260 precursor may not influence dicing activity, consistent with the lack of observable phenotypic effects.

9
Analysis of biological networks using Krylov subspace trajectories

Frost, H. R.

2026-03-31 bioinformatics 10.64898/2026.03.29.715092 medRxiv
Top 0.9%
0.7%
Show abstract

We describe an approach for analyzing biological networks using rows of the Krylov subspace of the adjacency matrix. Specifically, we explore the scenario where the Krylov subspace matrix is computed via power iteration using a non-random and potentially non-uniform initial vector that captures a specific biological state or perturbation. In this case, the rows the Krylov subspace matrix (i.e., Krylov trajectories) carry important functional information about the network nodes in the biological context represented by the initial vector. We demonstrate the utility of this approach for community detection and perturbation analysis using the C. Elegans neural network.

10
Integrative Transcriptomic and Machine Learning Analysis of ecDNA-Associated Features for Studying Chemotherapy Resistance in TNBC

Iftehimul, M.; Saha, D.

2026-04-06 cancer biology 10.64898/2026.04.02.716106 medRxiv
Top 0.9%
0.7%
Show abstract

Extrachromosomal DNA (ecDNA) has emerged as a critical mediator of oncogene amplification and transcriptional dynamics in aggressive cancers, yet its contribution to chemotherapy resistance in vivo remains incompletely understood. This study investigates the contribution of ecDNA-associated molecular features to predictive chemotherapy resistance in TNBC. We analyzed RNA-seq data from 4T1 TNBC cells and 4T1 bulk tumors at different growth stages (1-, 3-, and 6-week) to identify differentially expressed ecDNA alterations. We then utilized molecular docking tools to predict ecDNA protein-drug interactions and employed machine learning (ML) models to predict ecDNA-associated therapeutic resistance. Our results revealed changes in global gene expression, including expression of ecDNA-associated genes, that continued over time, with significant molecular remodeling observed at six weeks. Additionally, we found gradual accumulation of mutations in ecDNA genes, which may have contributed to reduced drug binding affinity, indicating potential resistance. ML models generated stable, high-confidence classifications of resistant phenotypes, consistently identifying ecDNA burden and prevalence as dominant predictive features of drug resistance. Drug specific predictions further highlighted elevated resistance probabilities for paclitaxel and doxorubicin, whereas hydroxyurea, which depletes ecDNA, showed reduced resistance probabilities, indicating potential roles of ecDNA in chemoresistance. This study provides new insights into temporal remodeling of ecDNA within TNBC tumors over time and their potential association with drug resistance.

11
Barcode Crosstalk in ONT Multiplex Sequencing: Quantification and Mitigation Strategies

Scharf, S. A.; Spohr, P.; Ried, M. J.; Haas, R.; Klau, G. W.; Henrich, B.; Pfeffer, K.

2026-03-28 molecular biology 10.64898/2026.03.27.714689 medRxiv
Top 0.9%
0.6%
Show abstract

Multiplexing samples in long-read sequencing with Oxford Nanopore Next Generation Sequencing Technology (ONT) by ligating specific native barcodes to individual DNA samples enables significant increases of high throughput sequencing combined with a significant reduction of sequencing costs. However, this advantage carries the risk of barcode misassignment / crosstalk. Employing ONT multiplex sequencing with samples, we observed misassigned barcodes so called barcode crosstalk, after ONT library preparation according to the standard protocol, particularly in samples with low input DNA concentrations. We assumed that these barcode misassignments are largely due to misligation of remaining native barcodes during subsequent the subsequent sequencing adapter ligation. To systematically investigate and quantify barcode crosstalk, genomic DNA (gDNA) from four bacterial type strains with different DNA input concentrations was prepared using three protocols for library preparation: the Nanopore standard protocol (protocol A: version valid until July 2, 2025) the new Nanopore protocol (protocol B: version from July 2, 2025), and an in house protocol with pooling of the barcoded samples only after the sequencing adapter ligation step (protocol C: in house). All samples were sequenced on a Nanopore PromethIon device. The results clearly showed that the use of protocol A resulted in a pronounced barcode crosstalk especially detectable in samples with low DNA input concentrations (up to 2.4% misassigned reads). The ONT adjustment in protocol B (altered washing buffer vs. protocol A) significantly alleviated the barcode crosstalk to below 0.01%, whereas protocol C eliminated barcode crosstalk virtually completely. These observations emphasize that sequencing results obtained with older ONT native barcoding protocol variants should be critically reviewed. The newer ONT barcoding protocol is preferable for sequencing, but it does not completely eliminate the barcode crosstalk effect. In conclusion, for low DNA input and high accuracy sequencing, protocol C is recommended.

12
Benchmark of biomarker identification and prognostic modeling methods on diverse censored data

Fletcher, W. L.; Sinha, S.

2026-04-01 bioinformatics 10.64898/2026.03.29.715113 medRxiv
Top 0.9%
0.6%
Show abstract

The practices of identifying biomarkers and developing prognostic models using genomic data has become increasingly prevalent. Such data often features characteristics that make these practices difficult, namely high dimensionality, correlations between predictors, and sparsity. Many modern methods have been developed to address these problematic characteristics while performing feature selection and prognostic modeling, but a large-scale comparison of their performances in these tasks on diverse right-censored time to event data (aka survival time data) is much needed. We have compiled many existing methods, including some machine learning methods, several which have performed well in previous benchmarks, primarily for comparison in regards to variable selection capability, and secondarily for survival time prediction on many synthetic datasets with varying levels of sparsity, correlation between predictors, and signal strength of informative predictors. For illustration, we have also performed multiple analyses on a publicly available and widely used cancer cohort from The Cancer Genome Atlas using these methods. We evaluated the methods through extensive simulation studies in terms of the false discovery rate, F1-score, concordance index, Brier score, root mean square error, and computation time. Of the methods compared, CoxBoost and the Adaptive LASSO performed well in all metrics, and the LASSO and elastic net excelled when evaluating concordance index and F1-score. The Benjamini-Hoschberg and q-value procedures showed volatile performances in controlling the false discovery rate. Some methods performances were greatly affected by differences in the data characteristics. With our extensive numerical study, we have identified the best performing methods for a plethora of data characteristics using informative metrics. This will help cancer researchers in choosing the best approach for their needs when working with genomic data.

13
Correlation Between Information Entropy and Functions of Gene Sequences in the Evolutionary Context: A New Way to Construct Gene Regulatory Networks from Sequence

Pan, L.; Chen, M.; Tanik, M.

2026-04-07 bioinformatics 10.64898/2026.04.03.714856 medRxiv
Top 1.0%
0.6%
Show abstract

The information encoded in DNA sequences can be rigorously quantified using Shannon entropy and related measures. When placed in an evolutionary context, this quantification offers a principled yet underexplored route to constructing gene regulatory networks (GRNs) directly from sequence data. While most GRN inference methods rely exclusively on gene expression profiles, the regulatory code is ultimately written in the DNA sequence itself. Here we review the mathematical foundations of information theory as applied to gene sequences, survey existing computational methods for GRN inference--with emphasis on information-theoretic and sequence-based approaches--and examine how evolutionary conservation constrains sequence entropy to preserve biological function. We then propose a four-layer integrative framework that combines per-position Shannon entropy profiles, evolutionary conservation scoring via Jensen- Shannon divergence, expression-based mutual information and transfer entropy, and DNA foundation model embeddings to construct GRNs from sequence. Through worked examples on the Escherichia coli SOS regulatory sub-network, we demonstrate how conservation-weighted mutual information improves edge discrimination and how transfer entropy resolves regulatory directionality. The framework generates testable predictions: edges supported by low-entropy regulatory regions should show higher experimental validation rates, and cross-species entropy profile conservation should predict GRN topology conservation. This work bridges three scales of biological information--nucleotide-level entropy, evolutionary constraint patterns, and network-level regulatory logic--establishing information entropy as the natural mathematical language for sequence-to-network regulatory inference.

14
Granularity screening identifies candidate genes involved in vaccinia virus induced LC3 lipidation

Yakimovich, A.; Krause, M.; Vago, N.; Drexler, I.; Mercer, J.

2026-03-30 cell biology 10.64898/2026.03.26.714436 medRxiv
Top 1%
0.5%
Show abstract

Autophagy is a catabolic process used for the degradation of organelles and proteins. Macroautophagy involves the formation of autophagosomes and subsequent fusion with lysosomes to mediate cargo degradation. It also functions as a cellular defence mechanism, known as xenophagy, during infection. Previous studies show that different viruses manipulate the autophagy pathway of the host cell to assure successful replication and/or virion assembly. Vaccinia virus (VACV), the prototypic poxvirus, replicates exclusively in the cytoplasm of host cells. It is known that VACV infection causes LC3 lipidation and prevents autophagosome formation, yet the double membrane vesicles formed during autophagy do not serve as the source of the mature VACV membrane. To date the viral protein(s) causing increased LC3 lipidation have not been identified. Here we developed an image-based screening approach based on LC3 granularity to identify candidate VACV genes affecting its lipidation. We identify several candidate viral membrane proteins as effectors of LC3 lipidation, suggesting that the interplay between VACV and autophagy is more directed than previously thought.

15
Functional Exploration of African Colorectal Cancer Patients Using Personalised Drosophila Avatars

Oladokun, F. A.; Oladokun, F. A.; Ajayi, A. A.; Ibrahim, A.; Aladeloye, R. S.; Akinfe, O. A.; Oludaiye, F. R.; Moens, T.; Badmos, H.; Abolaji, A. O.; Cagan, R. L.

2026-03-30 cancer biology 10.64898/2026.03.26.714433 medRxiv
Top 1%
0.5%
Show abstract

Colorectal cancer across sub-Saharan Africa presents a growing global health burden, with increasing cases and mortality linked to late diagnosis, limited healthcare access and lack of effective treatments. African patients typically present with aggressive disease marked by distinct genomic signatures, indicating the need for targeted treatment approaches. We integrated genetic modelling, phenotypic scoring, imaging and biochemical analysis to explore how mutations found in individual Nigerian colorectal cancer patients influence drug responsiveness. We used the data from Cancer Genome Atlas to identify mutation profiles specific to Nigerian patients. We then generated ten stable Drosophila melanogaster personalised patient avatar lines designed to model patient genomic profiles. This study focused on three lines; each line included oncogenic RAS plus targeting patient-specific variants. These models exhibited various phenotypes including altered larval size, gut size and reduced survival. Two of the three avatar lines showed improved survival, reduced hindgut proliferation zone expansion and restored redox balance after treatment with regorafenib and trametinib. Mirroring clinical patient responses, we found that response to therapy is dependent on the specific genetic profile of the tumour. Graphical Abstract O_FIG O_LINKSMALLFIG WIDTH=200 HEIGHT=111 SRC="FIGDIR/small/714433v1_ufig1.gif" ALT="Figure 1"> View larger version (31K): org.highwire.dtl.DTLVardef@110518aorg.highwire.dtl.DTLVardef@5965a0org.highwire.dtl.DTLVardef@11f16a3org.highwire.dtl.DTLVardef@744a1_HPS_FORMAT_FIGEXP M_FIG C_FIG O_LIAfrican colorectal cancer showed distinct mutation patterns that contribute to tumour heterogeneity. C_LIO_LIPatient-derived Drosophila avatars were engineered using tumour-specific genetic mutations with key features of human colorectal cancer. C_LIO_LITreatment with targeted therapies showed responses patterned by tumour genotype. C_LIO_LIResponse patterns indicated the need for personalised for colorectal cancer therapies among diverse populations. C_LI

16
FoldaVirus, a knowledge-based icosahedral capsid builder using AlphaFold

Rojas Labra, O.; Montoya-Munoz, D. S.; Santoyo-Rivera, N.; McDonald, J.; Montiel-Garcia, D.; Case, D. A.; Reddy, V. S.

2026-03-30 bioinformatics 10.64898/2026.03.27.714795 medRxiv
Top 1%
0.5%
Show abstract

Coat protein (CP) tertiary structures and their capsid organization of spherical viruses are highly conserved within each virus family. While AlphaFold successfully predicts the tertiary structures of individual CPs, their association to form proper quaternary assemblies cannot be easily accomplished. Here, we report a generalized methodology and associated web-based utility (https://foldavirus.org) that combines AlphaFold predictions of CPs with the knowledge on corresponding icosahedral architectures (e.g., T=1, 3, 4...) based on the known structures from the same virus family to generate associated capsids. The resulting assemblies are subjected to Amber energy minimization to relieve any steric clashes at the inter-subunit interfaces. Significantly, the capsid models are validated by calculating robust Mahalanobis distance using the residue annotations categorized as interface, core and surface amino acids with respect to those observed in the experimentally determined analogous structures. Given the amino acid sequence of CP(s), we successfully generated capsids up to T=9 icosahedral symmetry, including those of Picornaviruses that display pseudo-T=3 symmetry comprising different CPs. As the number of currently available CP sequences are 3-4 orders of magnitude larger than the experimentally determined 3D-structures, this approach bridges the huge gap that exists between the corresponding sequence and structure space.

17
WayFindR: Investigating Feedback in Biological Pathways

Bombina, P.; McGee, R. L.; Reed, J.; Abrams, Z.; Abruzzo, L. V.; Coombes, K. R.

2026-03-31 bioinformatics 10.64898/2026.03.27.714788 medRxiv
Top 1%
0.5%
Show abstract

Understanding biological pathways requires more than static diagrams. We present WayFindR, an R package that converts pathway data from WikiPathways and KEGG into graph structures using igraph, enabling computational analysis of regulatory features such as negative feedback loops. Rooted in control theory, negative feedback is essential for system stability, yet it is often underrepresented in curated pathway data. In this study, we systematically analyzed pathway information from both databases across multiple species and found that feedback loops--particularly negative ones--are rarely captured. This gap likely reflects both biological and technical challenges. Biologically, feedback mechanisms are inherently complex and often remain uncharted due to limited experimental focus. Technically, pathway databases frequently lack standardized annotations and complete representations of regulatory interactions, especially inhibitory edges that are crucial for identifying feedback. These observations underscore the need for improved data curation and consistent annotation practices to enhance our understanding of regulatory dynamics. By bridging the gap between static pathway diagrams and dynamic systems-level insights, WayFindR enables reproducible and scalable investigation of feedback regulation in cellular networks. The WayFindR R package can be downloaded from the Comprehensive R Archive Network (CRAN) (https://cran.r-project.org/web/packages/WayFindR/index.html). The processed data along with code for download can be accessed via the GitLab repository (https://gitlab.com/krcoombes/wayfindr).

18
TF-IDF k-mer-based Classical and Hybrid Machine Learning Models for SARS-CoV-2 Variant Classification under Imbalanced Genomic Data

Haque, N.; Mazed, A.; Ankhi, J. N.; Uddin, M. J.

2026-04-02 bioinformatics 10.64898/2026.04.02.716024 medRxiv
Top 1%
0.5%
Show abstract

Accurate classification of SARS-CoV-2 genomic variants is essential for effective genomic surveillance, yet it is challenged by extreme class imbalance, limited representation of rare variants, and distribution shifts in real-world sequencing data. In this study, we employed hybrid RF-SVM framework designed for robust detection of rare SARS-CoV-2 variants. It integrates a random forest and a polynomial-kernel based support vector machine to enhance sensitivity to minority classes while maintaining overall predictive stability. We systematically compared classical machine learning models, deep learning approaches, and hybrid strategies under both standard and distribution-shifted evaluation settings. Our results show that classical models using TF-IDF-based k-mer features outperform deep learning methods on macro-averaged performance metrics. The Random Forest classifier using TF-IDF Feature achieved the best overall performance, with a macro-averaged F1-score of 0.8894 and an accuracy of 96.3%. The model also demonstrated strong generalization ability, as evidenced by stable cross-validation performance (CV accuracy = 0.9637). Hybrid RF-SVM model further improves rare variant detection under severe class imbalance. Calibration analysis indicates reliable probability estimates for common variants, although challenges persist for minority classes. Overall, this study highlights the limitations of deep learning in highly imbalanced genomic settings and demonstrates that carefully designed hybrid machine learning approaches provide an effective and interpretable solution for rare SARS-CoV-2 variant detection.

19
Genomic indicators of gene function: A systematic assessment of the human genome

Cooper, H. B.; Rojas Lopez, K. E.; Schiavinato, D.; Black, M. A.; Gardner, P. P.

2026-04-09 genomics 10.64898/2026.04.08.717348 medRxiv
Top 1%
0.5%
Show abstract

Proteins and non-coding RNAs are functional products of the genome that are central for crucial cellular processes. With recent technological advances, researchers can sequence genomes in the thousands and probe numerous genomic activities of many species and conditions. Such studies have identified thousands of potential proteins, RNAs and associated activities. However there are conflicting interpretations of the results and therefore which regions of the genome are "functional". Here we investigate the relative strengths of associations between coding and non-coding gene functionality and genomic features, by comparing reliably annotated functional genes to non-genic regions of the genome. We find that the strongest and most consistent association between functional genes and genomic features are transcriptional activity and evolutionary conservation. We also evaluated sequence-based statistics, genomic repeats, epigenetic and population variation data. Other features strongly associated with function include histone marks, chromatin accessibility, genomic copy-number, and sequence alignment statistics such as coding potential and covariation. We also identify potential issues with SNP annotations in short non-coding RNAs, as some highly conserved ncRNAs have significantly higher than expected SNP densities. Our results demonstrate the importance of evolutionary conservation and transcription activity for indicating protein-coding and non-coding gene function. Both should be taken into consideration when differentiating between functional sequences and biological or experimental noise.

20
Cancer Variant Interpretation Group UK (CanVIG-UK): updates on an exemplar national subspecialty multidisciplinary network

Garrett, A.; Allen, S.; Rowlands, C. F.; Choi, S.; Durkie, M.; Burghel, G. J.; Robinson, R.; Callaway, A.; Field, J.; Frugtniet, B.; Palmer-Smith, S.; Grant, J.; Pagan, J.; McDevitt, T.; Hughes, L.; Johnston, E.; Yarram-Smith, L.; Logan, P.; Reed, L.; Snape, K.; Hanson, H.; McVeigh, T. P.; Turnbull, C.; CanVIG,

2026-03-19 genetic and genomic medicine 10.64898/2026.03.17.26348157 medRxiv
Top 1%
0.4%
Show abstract

Cancer Variant Interpretation Group UK was established in 2017 in response to the publication of the 2015 ACMG/AMP v3 guidance for the interpretation of sequence variants. Its initial purpose was to ensure consistency in the UK clinical-laboratory community implementation of ACMG/AMP v3 guidance for cancer susceptibility genes (CSGs). Still convening for monthly national meetings, the remit of CanVIG-UK now encompasses additional activities delivered under the following objectives: O_LICreation of a national multidisciplinary professional network and regular forum. C_LIO_LIDelivery of training and education. C_LIO_LIEstablishment of a consensus approach to the fundamentals of variant interpretation in cancer susceptibility genes. C_LIO_LIDevelopment and ratification of gene-specific frameworks for variant interpretation for cancer susceptibility genes. C_LIO_LIDevelopment and maintenance of an online platform to facilitate information sharing and variant interpretation within the UK clinical-laboratory community. C_LIO_LIFacilitation of UK contribution to international variant interpretation endeavours. C_LI A survey of CanVIG-UK members evaluating the impact of these activities conducted in November 2025 had 163 responses, including 113 clinical scientists/trainees and 27 Clinical Genetics consultants/trainees. The utility of the CanVIG-UK consensus recommendations for variant interpretation in cancer susceptibility genes was highly rated, with 89/145=61.4% of survey respondents reporting using the guidance at least weekly ([≥]4 times/month) and 124/128=96.9% rating it as extremely/very useful. The usage frequency and utility of the gene-specific guidance reported by survey respondents were similar to those reported for the main consensus specification. Both qualitative and quantitative survey responses clearly demonstrate the value of the CanVIG-UK activities to the clinical-diagnostic community. Key messagesO_LIWhat is already known on this topic: Cancer Variant Interpretation Group UK (CanVIG-UK) is a national subspeciality multidisciplinary network first established in 2017. It brings together members of the UK clinical-laboratory community to improve accuracy and consistency in the interpretation of variants in cancer susceptibility genes (CSG) C_LIO_LIWhat this study adds: this article presents the results of a survey of CanVIG-UK members, demonstrating the impact of CanVIG-UK activities on their services, as well as a review of progress in the six updated objectives of CanVIG-UK C_LIO_LIHow this study might affect research, practice or policy: this article presents current priorities and practices and potential future directions for variant interpretation in CSGs across the UK and Republic of Ireland C_LI